Machine learning of context of data fields for various document types

ABSTRACT

A method and system learns new forms to be incorporated into an electronic document preparation system. The method and system receive form data related to a new form having a plurality of data fields that expect data values based on specific functions. The method and system gather training set data including previously filled forms having completed data fields corresponding to the data fields of the new form. The method and system utilize machine learning in conjunction with the training set data to identify the correct function for each of the data fields of the new form.

RELATED CASES

This application is a Utility application depending from the U.S. provisional patent application filed Jul. 15, 2016 having attorney docket number INTU169813, Ser. No. 62/362,688, and entitled “SYSTEM AND METHOD FOR MACHINE LEARNING OF CONTEXT OF LINE INSTRUCTIONS FOR VARIOUS DOCUMENT TYPES,” which is hereby incorporated herein by reference in its entirety as if the contents were presented herein directly.

BACKGROUND

Many people use electronic document preparation systems to help prepare important documents electronically. For example, each year millions of people use electronic tax return preparation systems to help prepare and file their tax returns. Typically, electronic tax return preparation systems receive tax related information from a user and then automatically populate the various fields in electronic versions of government tax forms. Electronic tax return preparation systems represent a potentially flexible, highly accessible, and affordable source of tax return preparation assistance for customers. However, the processes that enable the electronic tax return preparation systems to incorporate new tax forms into the tax return preparation systems often utilize large amounts of human and computing resources.

For instance, due to changes in tax laws, or due to updates in government tax forms, tax forms can change from year to year, or even multiple times in the same year. If a tax form changes, or a new tax form is introduced, it can be very difficult to efficiently update the electronic tax return preparation system to correctly populate the various fields of the tax forms with the requested values. For example, a particular line of a newly adjusted tax form may request an input according to a function that requires values from other lines of the tax form and/or values from other tax forms or worksheets. These functions range from very simple to very complex. Updating the electronic tax return preparation system often includes utilizing a combination of tax experts, software and system engineers, and large amounts of computing resources to incorporate the new form into the electronic tax return preparation system. This can lead to delays in releasing an updated version of the electronic tax return preparation system as well as considerable expenses. These expenses are then passed on to customers of the electronic tax return preparation system, as are the delays. Furthermore, these processes for updating electronic tax returns can introduce inaccuracies into the tax return preparation system.

These expenses, delays, and possible inaccuracies can have an adverse impact on traditional electronic tax return preparation systems. Customers may lose confidence in the electronic tax return preparation systems. Furthermore, customers may simply decide to utilize less expensive options for preparing their taxes.

These issues and drawbacks are not limited to electronic tax return preparation systems. Any electronic document preparation system that assists users to electronically fill out forms or prepare documents can suffer from these drawbacks when the forms are updated or new forms are released.

What is needed is a method and system that efficiently and accurately incorporates updated forms into an electronic document preparation system.

SUMMARY

Embodiments of the present disclosure address some of the shortcomings associated with traditional electronic document preparation systems by providing methods and systems for incorporating new or updated forms by utilizing machine learning in conjunction with training set data. In particular, embodiments of the present disclosure receive form data related to a new form that includes data fields to be completed in accordance with specific functions designated by the new form. Embodiments of the present disclosure determine one or more possible dependencies for each data field. Embodiments of the present disclosure utilize machine learning to quickly and accurately determine the correct function needed to complete each data field of the form. Embodiments of the present disclosure gather training set data that includes previously filled forms related to the new form in order to assist in the machine learning process. The machine learning process for learning and incorporating the new form includes generating candidate functions for each data field of the new form based on the possible dependencies. The candidate functions can include one or more operators selected from a set or superset of operators. The machine learning process applies the candidate functions to the training set data in order to determine the accuracy of the candidate functions. For each data field, embodiments of the present disclosure generate and apply candidate functions in successive iterations until a candidate function is found that produces test data that matches the data values in the corresponding completed data fields of the previously filled forms of the training set data within a threshold level of error. Embodiments of the present disclosure then output results data that indicates that the correct function for a particular data field has possibly been found. This process is repeated for each selected data field of the new form until all selected data fields of the new form have been learned and incorporated. In this way, embodiments of the present disclosure provide a more reliable electronic document preparation system that quickly, efficiently, and reliably learns and incorporates new forms.

In one embodiment, the dependencies for a given data field of the new form can include data values from one or more other data fields of the new form. In one embodiment, the dependencies for a given data field of the new form can include data values from other data fields of one or more other forms, worksheets, or other locations. In one embodiment, the dependencies can include one or more constants. A list of possible dependencies for a given data field of the new form can be provided to the system by a natural language parsing process, an analysis of software instructions related to previous versions of electronic document preparation systems, by experts, and/or in other suitable ways.

In one embodiment, the correct function for a given data field of the new form can include operators that operate on one or more of the dependencies in a particular manner. The operators can include arithmetic operators such as addition, subtraction, multiplication, or division operators. The operators can include exponential functions. The operators can include logical operators such as if-then operators. The operators can include existence condition operators that depend on the existence of a data value in another data field of new form, in a form other than the new form, or in some other location or data set. The operators can include string comparisons. The operators can include rounding or truncating operations.

In one embodiment, the machine learning process is able to generate and test thousands of candidate functions very rapidly in successive iterations. The machine learning process can utilize one or more algorithms to generate candidate functions based on the one or more possible dependencies and/or other factors. The machine learning process can generate new candidate functions based on previously tested candidate functions that trended toward being a better match for the test data set.

In one embodiment, the machine learning process can generate and test a selected number of candidate functions and then generate results data that indicates how closely the candidate functions match the training set data. The machine learning process can stop and await input from an expert or other personnel indicating that a correct function has been found or that further candidate functions should be generated and tested. The results data can indicate candidate functions that are likely correct based on the matching data. Additionally, or alternatively, the results data can indicate only a certain number of the candidate functions that best matched the training set data. Additionally, or alternatively, the results data can indicate the results from all the candidate functions that were tested.

In one embodiment, the electronic document preparation system includes an electronic tax return preparation system. When a state or federal government introduces a new or updated tax form, the tax return preparation system utilizes machine learning in conjunction with training set data that includes historical tax related data related to previously prepared tax returns in order to quickly and efficiently learn and incorporate the new or updated tax form into the tax return preparation system. The tax return preparation system generates, for each data field of the new or updated tax form, a plurality of candidate functions in order to find the correct function that provides the data requested for the data field. The tax return preparation system applies the candidate functions to the historical tax related data in order to find a correct function that provides data values that match the data values in the completed data fields of the historical tax return data. The historical tax return data can include historical tax returns that have been prepared and filed with a state or federal government. The historical tax return data can include historical tax returns that have been accepted by a state or federal government agency or otherwise validated.

In some cases, it may not be feasible to obtain relevant historical tax related data related to previously filed tax returns to assist in the machine learning process of a new tax form. In these cases, the training set data can include fabricated tax returns completed by professionals or other tax return preparation systems using real or fabricated financial data.

In one example related to learning the correct function for a single data field of a new tax form, the tax return preparation system generates a candidate function for a specific line of a new tax form. The tax return preparation system generates test data by applying the candidate function to the historical tax return data. In particular, the tax return preparation system applies the candidate function to the tax related data associated with each of a plurality of previously filled tax forms that are related to the new tax form. The test data includes a test value for the specific line for each of the previously filled forms. The tax return preparation system generates matching data that indicates the degree to which the test values match the actual data values in the specific line of each of the historical tax returns. If the test data matches the actual data values in the specific line of the historical tax returns beyond a threshold degree of accuracy, then the tax return preparation system concludes that the candidate function is correct or likely correct. The tax return preparation system generates results data indicating whether the candidate function is likely correct.

In one embodiment, the electronic document preparation system can include a financial document preparation system other than a tax return preparation system. The financial document preparation system can include an invoice preparation system, a receipt preparation system, a payroll document preparation system, or any other type of electronic document preparation system. Furthermore, principles of the present disclosure are not limited to electronic document preparation systems but can extend to other types of electronic document preparation systems that assist users in filling out forms or other types of documents.

Embodiments of the present disclosure address some of the shortcomings associated with traditional electronic document preparation systems that do not adequately and efficiently incorporate new forms. An electronic document preparation system in accordance with one or more embodiments provides efficient and reliable incorporation of new forms by utilizing machine learning in conjunction with training set data in order to quickly and accurately incorporate and learn new forms. The various embodiments of the disclosure can be implemented to improve the technical fields of data processing, resource management, data collection, and user experience. Therefore, the various described embodiments of the disclosure and their associated benefits amount to significantly more than an abstract idea. In particular, by utilizing machine learning to learn and incorporate new forms in an electronic document preparation system, users can save money and time and can better manage their finances.

Using the disclosed embodiments of a method and system for learning and incorporating new forms in an electronic document preparation system, a method and system for learning and incorporating new forms in an electronic document preparation system more accurately is provided. Therefore, the disclosed embodiments provide a technical solution to the long standing technical problem of efficiently learning and incorporating new forms in an electronic document preparation system.

In addition, the disclosed embodiments of a method and system for learning and incorporating new forms in an electronic document preparation system are also capable of dynamically adapting to constantly changing fields such as tax return preparation and other kinds of document preparation. Consequently, the disclosed embodiments of a method and system for learning and incorporating new forms in an electronic document preparation system also provide a technical solution to the long standing technical problem of static and inflexible electronic document preparation systems.

The result is a much more accurate, adaptable, and robust method and system for learning and incorporating new forms in an electronic document preparation system, but thereby serves to bolster confidence in electronic document preparation systems. This, in turn, results in: less human and processor resources being dedicated to analyzing new forms because more accurate and efficient analysis methods can be implemented, i.e., fewer processing and memory storage assets; less memory and storage bandwidth being dedicated to buffering and storing data; less communication bandwidth being utilized to transmit data for analysis.

The disclosed method and system for learning and incorporating new forms in an electronic document preparation system does not encompass, embody, or preclude other forms of innovation in the area of electronic document preparation system. In addition, the disclosed method and system for learning and incorporating new forms in an electronic document preparation system is not related to any fundamental economic practice, fundamental data processing practice, mental steps, or pen and paper based solutions, and is, in fact, directed to providing solutions to new and existing problems associated with electronic document preparation systems. Consequently, the disclosed method and system for learning and incorporating new forms in an electronic document preparation system, does not encompass, and is not merely, an abstract idea or concept.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of software architecture for learning and incorporating new forms in an electronic document preparation system, in accordance with one embodiment.

FIG. 2 is a block diagram of a process for learning and incorporating new forms in an electronic document preparation system, in accordance with one embodiment.

FIG. 3 is a flow diagram of a process for learning and incorporating new forms in an electronic document preparation system, in accordance with one embodiment.

Common reference numerals are used throughout the FIG.s and the detailed description to indicate like elements. One skilled in the art will readily recognize that the above FIG.s are examples and that other architectures, modes of operation, orders of operation, and elements/functions can be provided and implemented without departing from the characteristics and features of the invention, as set forth in the claims.

DETAILED DESCRIPTION

Embodiments will now be discussed with reference to the accompanying FIG.s, which depict one or more exemplary embodiments. Embodiments may be implemented in many different forms and should not be construed as limited to the embodiments set forth herein, shown in the FIG.s, and/or described below. Rather, these exemplary embodiments are provided to allow a complete disclosure that conveys the principles of the invention, as set forth in the claims, to those of skill in the art.

Herein, the term “production environment” includes the various components, or assets, used to deploy, implement, access, and use, a given application as that application is intended to be used. In various embodiments, production environments include multiple assets that are combined, communicatively coupled, virtually and/or physically connected, and/or associated with one another, to provide the production environment implementing the application.

As specific illustrative examples, the assets making up a given production environment can include, but are not limited to, one or more computing environments used to implement the application in the production environment such as a data center, a cloud computing environment, a dedicated hosting environment, and/or one or more other computing environments in which one or more assets used by the application in the production environment are implemented; one or more computing systems or computing entities used to implement the application in the production environment; one or more virtual assets used to implement the application in the production environment; one or more supervisory or control systems, such as hypervisors, or other monitoring and management systems, used to monitor and control assets and/or components of the production environment; one or more communications channels for sending and receiving data used to implement the application in the production environment; one or more access control systems for limiting access to various components of the production environment, such as firewalls and gateways; one or more traffic and/or routing systems used to direct, control, and/or buffer, data traffic to components of the production environment, such as routers and switches; one or more communications endpoint proxy systems used to buffer, process, and/or direct data traffic, such as load balancers or buffers; one or more secure communication protocols and/or endpoints used to encrypt/decrypt data, such as Secure Sockets Layer (SSL) protocols, used to implement the application in the production environment; one or more databases used to store data in the production environment; one or more internal or external services used to implement the application in the production environment; one or more backend systems, such as backend servers or other hardware used to process data and implement the application in the production environment; one or more software systems used to implement the application in the production environment; and/or any other assets/components making up an actual production environment in which an application is deployed, implemented, accessed, and run, e.g., operated, as discussed herein, and/or as known in the art at the time of filing, and/or as developed after the time of filing.

As used herein, the terms “computing system”, “computing device”, and “computing entity”, include, but are not limited to, a virtual asset; a server computing system; a workstation; a desktop computing system; a mobile computing system, including, but not limited to, smart phones, portable devices, and/or devices worn or carried by a user; a database system or storage cluster; a switching system; a router; any hardware system; any communications system; any form of proxy system; a gateway system; a firewall system; a load balancing system; or any device, subsystem, or mechanism that includes components that can execute all, or part, of any one of the processes and/or operations as described herein.

In addition, as used herein, the terms computing system and computing entity, can denote, but are not limited to, systems made up of multiple: virtual assets; server computing systems; workstations; desktop computing systems; mobile computing systems; database systems or storage clusters; switching systems; routers; hardware systems; communications systems; proxy systems; gateway systems; firewall systems; load balancing systems; or any devices that can be used to perform the processes and/or operations as described herein.

As used herein, the term “computing environment” includes, but is not limited to, a logical or physical grouping of connected or networked computing systems and/or virtual assets using the same infrastructure and systems such as, but not limited to, hardware systems, software systems, and networking/communications systems. Typically, computing environments are either known environments, e.g., “trusted” environments, or unknown, e.g., “untrusted” environments. Typically, trusted computing environments are those where the assets, infrastructure, communication and networking systems, and security systems associated with the computing systems and/or virtual assets making up the trusted computing environment, are either under the control of, or known to, a party.

In various embodiments, each computing environment includes allocated assets and virtual assets associated with, and controlled or used to create, and/or deploy, and/or operate an application.

In various embodiments, one or more cloud computing environments are used to create, and/or deploy, and/or operate an application that can be any form of cloud computing environment, such as, but not limited to, a public cloud; a private cloud; a virtual private network (VPN); a subnet; a Virtual Private Cloud (VPC); a sub-net or any security/communications grouping; or any other cloud-based infrastructure, sub-structure, or architecture, as discussed herein, and/or as known in the art at the time of filing, and/or as developed after the time of filing.

In many cases, a given application or service may utilize, and interface with, multiple cloud computing environments, such as multiple VPCs, in the course of being created, and/or deployed, and/or operated.

As used herein, the term “virtual asset” includes any virtualized entity or resource, and/or virtualized part of an actual, or “bare metal” entity. In various embodiments, the virtual assets can be, but are not limited to, virtual machines, virtual servers, and instances implemented in a cloud computing environment; databases associated with a cloud computing environment, and/or implemented in a cloud computing environment; services associated with, and/or delivered through, a cloud computing environment; communications systems used with, part of, or provided through, a cloud computing environment; and/or any other virtualized assets and/or sub-systems of “bare metal” physical devices such as mobile devices, remote sensors, laptops, desktops, point-of-sale devices, etc., located within a data center, within a cloud computing environment, and/or any other physical or logical location, as discussed herein, and/or as known/available in the art at the time of filing, and/or as developed/made available after the time of filing.

In various embodiments, any, or all, of the assets making up a given production environment discussed herein, and/or as known in the art at the time of filing, and/or as developed after the time of filing, can be implemented as one or more virtual assets.

In one embodiment, two or more assets, such as computing systems and/or virtual assets, and/or two or more computing environments, are connected by one or more communications channels including but not limited to, Secure Sockets Layer communications channels and various other secure communications channels, and/or distributed computing system networks, such as, but not limited to: a public cloud; a private cloud; a virtual private network (VPN); a subnet; any general network, communications network, or general network/communications network system; a combination of different network types; a public network; a private network; a satellite network; a cable network; or any other network capable of allowing communication between two or more assets, computing systems, and/or virtual assets, as discussed herein, and/or available or known at the time of filing, and/or as developed after the time of filing.

As used herein, the term “network” includes, but is not limited to, any network or network system such as, but not limited to, a peer-to-peer network, a hybrid peer-to-peer network, a Local Area Network (LAN), a Wide Area Network (WAN), a public network, such as the Internet, a private network, a cellular network, any general network, communications network, or general network/communications network system; a wireless network; a wired network; a wireless and wired combination network; a satellite network; a cable network; any combination of different network types; or any other system capable of allowing communication between two or more assets, virtual assets, and/or computing systems, whether available or known at the time of filing or as later developed.

As used herein, the term “user” includes, but is not limited to, any party, parties, entity, and/or entities using, or otherwise interacting with any of the methods or systems discussed herein. For instance, in various embodiments, a user can be, but is not limited to, a person, a commercial entity, an application, a service, and/or a computing system.

As used herein, the term “relationship(s)” includes, but is not limited to, a logical, mathematical, statistical, or other association between one set or group of information, data, and/or users and another set or group of information, data, and/or users, according to one embodiment. The logical, mathematical, statistical, or other association (i.e., relationship) between the sets or groups can have various ratios or correlation, such as, but not limited to, one-to-one, multiple-to-one, one-to-multiple, multiple-to-multiple, and the like, according to one embodiment. As a non-limiting example, if the disclosed electronic document preparation system determines a relationship between a first group of data and a second group of data, then a characteristic or subset of a first group of data can be related to, associated with, and/or correspond to one or more characteristics or subsets of the second group of data, or vice-versa, according to one embodiment. Therefore, relationships may represent one or more subsets of the second group of data that are associated with one or more subsets of the first group of data, according to one embodiment. In one embodiment, the relationship between two sets or groups of data includes, but is not limited to similarities, differences, and correlations between the sets or groups of data.

HARDWARE ARCHITECTURE

FIG. 1 illustrates a block diagram of a production environment 100 for learning and incorporating new forms in an electronic document preparation system, according to one embodiment. Embodiments of the present disclosure provide methods and systems for learning and incorporating new forms in an electronic document preparation system, according to one embodiment. In particular, embodiments of the present disclosure receive form data related to a new form having data fields to be completed according to functions set forth in the new form and utilize machine learning in order to correctly learn the functions for each data field and incorporate them into the electronic document preparation system. Embodiments of the present disclosure gather training set data including previously filled forms related to the new form. Embodiments of the present disclosure generate, for each data field to be learned, dependency data that indicates one or more possible dependencies likely to be included in a correct function for the data field. Embodiments of the present disclosure utilize machine learning systems and processes to generate a plurality of candidate functions for each data field to be learned. The candidate functions are based on the one or more possible dependencies and can include one or more operators selected from a set of operators. The operators can operate on one or more of the possible dependencies. Embodiments of the present disclosure generate test data for each candidate function by applying the candidate function to the training set data. Embodiments of the present disclosure compare the test data to the data values in the corresponding fields of the previously filled forms of the training set data. Embodiments of the present disclosure generate matching data indicating how closely the test data matches the values in the previously filled forms of the training set data. The machine learning processes can continue generating candidate functions and test data until a candidate function is found that provides test data that matches the completed fields of the training set data within a selected error tolerance percentage. Embodiments of the present disclosure can generate results data that indicates the correct functions for each data field of the new form. Embodiments of the present disclosure can output the results data for review by experts who can review and approve the correct functions. Additionally, or alternatively, embodiments of the present disclosure can determine when a correct candidate function has been found or when the new form has been entirely learned and can incorporate the new form into a user document preparation engine so that users or customers of the electronic document preparation system can utilize the electronic document preparation system to electronically prepare documents using the new form. By utilizing machine learning to learn and incorporate new forms, efficiency of the electronic document preparation system is increased.

In addition, the disclosed method and system for learning and incorporating new forms in an electronic document preparation system provides for significant improvements to the technical fields of electronic financial document preparation, data processing, data management, and user experience.

In addition, as discussed above, the disclosed method and system for learning and incorporating new forms in an electronic document preparation system provide for the processing and storing of smaller amounts of data, i.e., more efficiently acquire and analyze forms and data; thereby eliminating unnecessary data analysis and storage. Consequently, using the disclosed method and system for learning and incorporating new forms in an electronic document preparation system results in more efficient use of human and non-human resources, fewer processor cycles being utilized, reduced memory utilization, and less communications bandwidth being utilized to relay data to, and from, backend systems and client systems, and various investigative systems and parties. As a result, computing systems are transformed into faster, more efficient, and more effective computing systems by implementing the method and system for learning and incorporating new forms in an electronic document preparation system.

The production environment 100 includes a service provider computing environment 110, user computing environment 140, third party computing environments 150, and public information computing environments 160, for learning and incorporating new forms in an electronic document preparation system, according to one embodiment. The computing environments 110, 140, 150, and 160 are communicatively coupled to each other with one or more communication channels 101, according to one embodiment.

The service provider computing environment 110 represents one or more computing systems such as a server, a computing cabinet, and/or distribution center that is configured to receive, execute, and host one or more electronic document preparation systems (e.g., applications) for access by one or more users, for learning and incorporating new forms in an electronic document preparation system, according to one embodiment. The service provider computing environment 110 represents a traditional data center computing environment, a virtual asset computing environment (e.g., a cloud computing environment), or a hybrid between a traditional data center computing environment and a virtual asset computing environment, according to one embodiment.

The service provider computing environment 110 includes an electronic document preparation system 111, which is configured to provide electronic document preparation services to a user.

According to one embodiment, the electronic document preparation system 111 can be a system that assists in preparing financial documents related to one or more of tax return preparation, invoicing, payroll management, billing, banking, investments, loans, credit cards, real estate investments, retirement planning, bill pay, and budgeting. The electronic document preparation system 111 can be a tax return preparation systems or other type of electronic document preparation system. The electronic document preparation system 111 can be a standalone system that provides financial document preparation services to users. Alternatively, the electronic document preparation system 111 can be integrated into other software or service products provided by a service provider.

The electronic document preparation system 111 assists users in preparing documents related to one or more forms that include data fields to be completed by the user. The data fields request data entries in accordance with specified functions. Once the electronic document preparation system has learned the functions that produce the requested data entries for the data fields, the electronic document preparation system can assist individual users in electronically completing the form.

In many situations, such as in tax return preparation situations, state and federal governments or other financial institutions issue new or updated versions of standardized forms each year or even several times within a single year. Each time a new form is released, the electronic document preparation system 111 may need to learn the specific functions that provide the requested data entries for each data field in the new form. If these data fields are not correctly completed, there can be serious financial consequences for users. Furthermore, if the electronic document preparation system 111 does not quickly learn and incorporate new forms into the electronic document preparation system 111, users of the electronic document preparation system 111 may turn to other forms of financial document preparation services. In traditional electronic document preparation systems, new forms are learned and incorporated by financial professionals and/or experts manually reviewing the new forms and manually revising software instructions to incorporate the new forms. In some cases, this can be a slow, expensive, and unreliable system. Thus, the electronic document preparation system 111 in accordance with principles of the present disclosure advantageously utilizes machine learning in addition to training second data in order to quickly and efficiently learn the functions related to each data field of a form and incorporate them into the electronic document preparation system 111.

According to one embodiment, the electronic document preparation system 111 receives form data related to a new or updated version of a form. The electronic document preparation system 111 analyzes the form data and identifies data fields of the form. The electronic document preparation system 111 acquires training set data that is related to the new or updated version of the form. The training set data can include historical data related to previously prepared documents including copies of the form, or a related form, with completed data fields. The previously prepared documents can include previously prepared documents that have already been filed and approved with government or other institutions, or that were otherwise validated or approved. Additionally, or alternatively, the training set data can include fabricated data that includes previously prepared documents using fictitious data or real data that has been scrubbed of personal identifiers or otherwise altered. The electronic document preparation system 111 utilizes machine learning in combination with the training set data to learn the functions that provide the requested data entries for the data fields of the new form.

In one embodiment, the electronic document preparation system 111 can identify one or more possible dependencies for each data field to be learned. These possible dependencies can include one or more data values from other data fields of the new form, one or more data values from another related form or worksheet, one or more constants, or many other kinds of possible dependencies that can be included in a correct function for a particular data field. The electronic document preparation system 111 can identify the one or more possible dependencies based on natural language parsing of the descriptive text included in the new form and related to the data field. The electronic document preparation system can identify one or more possible dependencies by analyzing software from previous electronic document preparation systems that processed forms related to the new form. The electronic document preparation system 111 can identify possible dependencies by receiving data from an expert, from a third party, or from another source.

In one embodiment, the electronic document preparation system 111 generates, for each data field to be learned, a plurality of candidate functions based on the one or more dependencies and including one or more operators from a set or superset of operators. The electronic document preparation system 111 generates test data by applying the candidate functions to the training set data. The electronic document preparation system 111 then generates matching data that indicates how closely the test data matches the previously completed data fields of the training set data. When the electronic document preparation system 111 finds a candidate function that results in test data that matches the training set data within a selected error tolerance, electronic document preparation system 111 can determine that the candidate function is the correct function for the particular data field of the new form.

In one embodiment, the electronic document preparation system 111 can generate and output results data for review by an expert. The results data can include candidate functions that are determined to be the correct functions for respective data fields of the new form. The electronic document preparation system 111 can request input from the expert to approve the candidate function. Additionally, or alternatively, the electronic document preparation system 111 can determine that the candidate function is correct and update the electronic document preparation system 111 without review or approval by an expert. In this way, the electronic document preparation system can learn and incorporate new or revised forms into an electronic document preparation system 111.

The electronic document preparation system 111 includes a user interface module 112, a machine learning module 113, a data acquisition module 114, a natural language parsing module 115, a historical form analysis module 116, and a user document preparation engine 117, according to one embodiment.

The interface module 112 is configured to receive form data 119 related to a new form. The interface module 112 can receive the form data 119 from an expert, from a government agency, from a financial institution, or in other suitable ways. According to one embodiment, when a new form or new version of a form is released, an expert or other personnel of the electronic document preparation system 111 can upload an electronic version of the form to the interface module 112. The interface module 112 can also receive the form data in an automated manner such as by receiving automatic updates or in another way. The electronic version of the form is represented by the form data 119. The form data 119 can include a PDF document, an HTML document, an accessible PDF document, or other types of electronic document formats. The form data can include data related the data fields, limiting values, tables, or other data related to the new form and its data fields that will be useful in the machine learning process.

The interface module 112 can also output results data 120 indicating the results of a machine learning process for particular candidate functions. The interface module 112 can also output learned form data 121 related to the finalized learned functions of the new form. An expert can obtain and review the results data 120 and the learned form data 121 from the interface module 112. Results data 120 or other test data can also be utilized by an expert and/or an automated system to use for other purposes. For example: results data 120 or other test data can be used by electronic document preparation systems to test software instructions of the electronic document preparation system before making functionality associated with the software instructions available to the public.

The machine learning module 113 analyzes the form data 119 in order to learn the functions for the data fields of the new form and incorporate them into the electronic document preparation system 111. The machine learning module 113 generates the results data 120 and the learned form data 121.

In one embodiment, the machine learning module 113 is able to generate and test thousands of candidate functions very rapidly in successive iterations. The machine learning module 113 can utilize one or more algorithms to generate candidate functions based on many factors. The machine learning module 113 can generate new candidate functions based on previously tested candidate functions. The machine learning module 113 can utilize analysis of the form data and/or other data to learn the likely dependencies/components of the correct function for a particular data field and can generate candidate functions based on these likely components.

In one embodiment, the electronic document preparation system 111 uses the data acquisition module 114 to acquire training set data 122. The training set data 122 includes previously prepared documents for a large number of previous users of the electronic document preparation system 111 or fictitious users of the electronic document preparation system 111. The training set data 122 can be used by the machine learning module 113 in order to learn and incorporate the new form into the electronic document preparation system 111.

In one embodiment, the training set data 122 can include historical data 123 related to previously prepared documents or previously filled forms of a large number of users. The historical data 123 can include, for each of a large number of previous users of the electronic document preparation system 111, a respective completed copy of the new form or a completed copy of a form related to the new form. The completed copies of the form include data values in the data fields.

In one embodiment, the training set data 122 can include fabricated data 124. The fabricated data 124 can include copies of the new form that were previously filled using fabricated data. The fabricated data can include real data from previous users or other people but that has been scrubbed of personal identifiers or otherwise altered.

In one embodiment, the historical data 123 and/or the fabricated data 124 also includes all of the related data used to complete the forms and to prepare the historical document. The historical data 123 can include previously prepared documents that include or use the completed form and which were filed with and/or approved by a government or other institution. In this way, the historical data 123 can be assured in large part to be accurate and properly prepared, though some of the data related to the previously prepared documents will inevitably include errors. Typically, the functions for computing or obtaining the proper data entry for a data field of a form can include data values from other forms resources related to each other and sometimes complex ways. Thus, the historical data 123 can include, for each historical user in the training set data, a final version of a previously prepared document, the form that is related to the new form to be learned, other forms used to calculate the values for the related form, and other sources of data for completing the related form.

In one embodiment, the electronic document preparation system 111 is a financial document preparation system. In this case, the historical data 123 can include historical financial data. The historical financial data can include, for each historical user of the electronic document preparation system 111, information, such as, but not limited to, a name of the user, a name of the user's employer, an employer identification number (EID), a job title, annual income, salary and wages, bonuses, a Social Security number, a government identification, a driver's license number, a date of birth, an address, a zip code, home ownership status, marital status, W-2 income, an employer's address, spousal information, children's information, asset information, medical history, occupation, information regarding dependents, salary and wages, interest income, dividend income, business income, farm income, capital gain income, pension income, IRA distributions, education expenses, health savings account deductions, moving expenses, IRA deductions, student loan interest, tuition and fees, medical and dental expenses, state and local taxes, real estate taxes, personal property tax, mortgage interest, charitable contributions, casualty and theft losses, unreimbursed employee expenses, alternative minimum tax, foreign tax credit, education tax credits, retirement savings contribution, child tax credits, residential energy credits, and any other information that is currently used, that can be used, or that may be used in the future, in a financial document preparation system or in the preparation of financial documents such as a user's tax return, according to various embodiments.

In one embodiment, the data acquisition module 114 is configured to obtain or retrieve historical data 123 from a large number of sources. The data acquisition module 114 can retrieve, from databases of the electronic document preparation system 111, historical data 123 that has been previously obtained by the electronic document preparation system 111 from a plurality of third-party institutions. Additionally, or alternatively, the data acquisition module 114 can retrieve the historical data 123 afresh from the third-party institutions.

In one embodiment, the data acquisition module 114 can also supply or supplement the historical data 123 by gathering pertinent data from other sources including the third party computing environment 150, the public information computing environment 160, the additional service provider systems 135, data provided from historical users, data collected from user devices or accounts of the electronic document preparation system 111, social media accounts, and/or various other sources to merge with or supplement historical data 123, according to one embodiment.

The data acquisition module 114 can gather additional data including historical financial data and third party data. For example, the data acquisition module 114 is configured to communicate with additional service provider systems 135, e.g., a tax return preparation system, a payroll management system, or other electronic document preparation system, to access financial data 136, according to one embodiment. The data acquisition module 114 imports relevant portions of the financial data 136 into the electronic document preparation system 111 and, for example, saves local copies into one or more databases, according to one embodiment.

In one embodiment, the additional service provider systems 135 include a personal electronic document preparation system, and the data acquisition module 114 is configured to acquire financial data 136 for use by the electronic document preparation system 111 in learning and incorporating the new or updated form into the electronic document preparation system 111. Because the services provider provides both the electronic document preparation system 111 and, for example, the additional service provider systems 135, the service provider computing environment 110 can be configured to share financial information between the various systems. By interfacing with the additional service provider systems 135, the data acquisition module 114 can supply or supplement the historical data 123 from the financial data 136. The financial data 136 can include income data, investment data, property ownership data, retirement account data, age data, data regarding additional sources of income, marital status, number and ages of children or other dependents, geographic location, and other data that indicates personal and financial characteristics of users of other financial systems, according to one embodiment.

The data acquisition module 114 is configured to acquire additional information from various sources to merge with or supplement the training set data 122, according to one embodiment. For example, the data acquisition module 114 is configured to gather from various sources historical data 123. For example, the data acquisition module 114 is configured to communicate with additional service provider systems 135, e.g., a tax return preparation system, a payroll management system, or other financial management system, to access financial data 136, according to one embodiment. The data acquisition module 114 imports relevant portions of the financial data 136 into the training set data 122 and, for example, saves local copies into one or more databases, according to one embodiment.

The data acquisition module 114 is configured to acquire additional financial data from the public information computing environment 160, according to one embodiment. The training set data can be gathered from public record searches of tax records, public information databases, property ownership records, and other public sources of information. The data acquisition module 114 can also acquire data from sources such as social media websites, such as Twitter, Facebook, LinkedIn, and the like.

The data acquisition module 114 is configured to acquire data from third parties, according to one embodiment. For example, the data acquisition module 114 requests and receives third party data from the third party computing environment 150 to supply or supplement the training set data 122, according to one embodiment. In one embodiment, the third party computing environment 150 is configured to automatically transmit financial data to the electronic document preparation system 111 (e.g., to the data acquisition module 114), to be merged into training set data 122. The third party computing environment 150 can include, but is not limited to, financial service providers, state institutions, federal institutions, private employers, financial institutions, social media, and any other business, organization, or association that has maintained financial data, that currently maintains financial data, or which may in the future maintain financial data, according to one embodiment.

In one embodiment, the electronic document preparation system 111 utilizes the machine learning module 113 to learn the data fields of the new form in conjunction with training set data 122. The machine learning module 113 generates a plurality of candidate functions for each data field of the new form to be learned and applies the candidate functions to the training set data 122 in order to find a candidate function that produces data values that match the corresponding data values in the completed data fields of the training set data 122. The machine learning module 113 can continue to generate new candidate functions until the machine learning module 113 finds a candidate function that, when applied to the training set data 122, produces data values that match the data values in the completed data fields of the training set data.

In one embodiment, the electronic document preparation system 111 identifies dependency data 129 including one or more possible dependencies for each data field to be learned. These possible dependencies can include one or more data values from other data fields of the new form, one or more data values from another related form or worksheet, one or more constants, or many other kinds of possible dependencies that can be included in a correct function for a particular data field.

In one embodiment, the machine learning module 113 generates candidate functions based on the dependency data 129 and one or more operators selected from a set or superset of operators. The operators can include arithmetic operators such as addition, subtraction, multiplication, or division operators. The operators can include logical operators such as if-then operators. The operators can include existence condition operators that depend on the existence of a data value in another data field of new form, in a form other than the new form, or in some other location or data set. The operators can include string comparisons. Each candidate function can include one or more of the operators operating on one or more of the possible dependencies.

In one embodiment, the machine learning module 113 learns the correct function for the data fields one at a time. In other words, if the form data 119 indicates that a form has 10 data fields to be learned, the machine learning module 113 will begin by learning the correct function for a first data field of the new form. In particular, the machine learning module 113 will generate candidate function data 125 corresponding to a plurality of candidate functions for the first data field of the new form as represented by the form data 119. The machine learning module 113 also receives training set data 122 from the data acquisition module 114. The training set data 122 includes data related to previously completed copies of the form to be learned or previously completed copies of a form closely related to the new form to be learned. In particular, the training set data 122 includes copies of the form that have a data entry in the data field that corresponds to the data field of the new form currently being analyzed and learned by the machine learning module 113. The training set data 122 also includes data that was used to calculate the data values in the data field for each copy of the form or for each copy of the related form, e.g. W-2 data, income data, data related to other forms such as tax forms, payroll data, personal information, or any other kind of information that was used to complete the copies of the form or the copies of the related form in the training set data 122. The machine learning module 113 generates test data 126 by applying each of the candidate functions to the training set data for the particular data field currently being learned. In particular, for each copy of the form or related form in the training set data 122, the machine learning module 113 applies the candidate function to the training set data related to that copy of the form in order to generate a test data value for the data field. Thus, if the training set data 122 includes 1000 completed copies of the new form or a related form, then machine learning module 113 will generate test data 126 that includes one test data value for the particular data field being analyzed for each of the thousand completed copies. In one embodiment, the machine learning module 113 then generates matching data 127 by comparing the test data value for each copy of the form to the actual data value from the completed data field of that copy of the form. The matching data 127 indicates how many of the test data values match their corresponding completed data value from the training set data 122. If the candidate function is correct, then the test data values will match the completed data values for nearly every copy of the form or related form in the training set data 122.

It is expected that the training set data 122 may include some errors in the completed data values for the data field under test. Thus, the correct function may result in test data 126 that does not perfectly match the completed data fields in the training set data 122. Thus, the correct candidate function will result in test data that matches the training set data within an error tolerance. In one embodiment, the machine learning module 113 will continue to generate and test candidate functions until a candidate function has been found that results in test data that matches the training set data 122 within the error tolerance. When the correct function has been found for the first data field of the new form, the machine learning module 113 can repeat this process for the second data field of the new form to be learned. The machine learning module 113 can continue in this manner until the correct function for each data field of the new form has been found.

In one embodiment, the machine learning module 113 generates and tests candidate functions one at a time. Each time the matching data 127 for a candidate function does not indicate that the candidate function is correct, the machine learning module 113 generates a new candidate function and tests the new candidate function. The machine learning module 113 can continue this process until the correct candidate function has been found. In this way, the machine learning module 113 generates a plurality of candidate functions sequentially for each data field under test.

In one embodiment, the machine learning module 113 can first generate a plurality of candidate functions and then test each of the candidate functions. If the matching data 127 indicates that none of the candidate functions is the correct candidate function, then the machine learning module 113 can generate a second plurality of candidate functions and apply them to the training set data 122. The machine learning module 113 can continue generating candidate functions and applying them to the training set data until the correct function has been found.

In one embodiment, the machine learning module 113 generates candidate functions in successive iterations based on one or more algorithms. The successive iterations can be based on whether the matching data indicates that the candidate functions are becoming more accurate. The machine learning module 113 can continue to make adjustments to the candidate functions in directions that make the matching data more accurate until the correct function has been found.

In one embodiment, the machine learning module 113 generates confidence score data 122 based on the matching data 127. The confidence score data 128 can indicate, for each candidate function, how confident the machine learning module 113 is that the candidate function is the correct function. The confidence score data 128 can be based on the matching data 127 and recurrence data.

In one embodiment, the machine learning module 113 generates results data 120. The results data 120 can include matching data 127 and/or confidence score data 128 for each candidate function that has been tested for particular data field of the new form to be learned. Alternatively, the results data 120 can include data indicating that one or more of the candidate functions is possibly correct based on the matching data 127 and/or the confidence score 128. Alternatively, the results data 120 can indicate that the correct function has been found. The results data 120 can also indicate what the correct function is. The results data 120 can be provided to the interface module 112. The interface module 112 can output the results data 120 to an expert or other personnel for review and/or approval.

In one embodiment, the machine learning module 113 outputs results data 120 indicating that a candidate function has been found that is likely correct. The results data 120 can indicate what the candidate function is, the matching data 127 or confidence score data 128 related to the candidate function, or any other information that will be useful for review by an expert. The machine learning module 113 can cause the interface module 112 to prompt an expert to review the results data 120 and to approve the candidate function as correct or to indicate that the candidate function is not correct and that the machine learning module 113 should continue generating candidate functions for the data field currently under test. The machine learning module 113 awaits input from the expert or other personnel approving the candidate function. If the candidate function is approved by the expert or other personnel, the machine learning module 113 determines that the correct function has been found and moves on to finding the correct function the next data field of the new form.

In one embodiment, the machine learning module 113 does not wait for the approval of an expert before determining that the correct candidate function test and found. Instead, when the machine learning module 113 determines that the correct function has been found based on the matching data, the confidence score data 128, and/or other criteria, the machine learning module 113 moves onto the next data field of the new form under test.

In one embodiment, when the machine learning module 113 has learned the correct function for each data field of the new form, then the machine learning module 113 generates learned form data 121. The learned form data 121 indicates that the new form has been learned. The learned form data 121 can also indicate what the correct functions are for each of the data fields of the new form. The interface module 112 can output the learned form data 121 for review and/or approval by expert. In one embodiment, once the expert or other personnel has approved the learned form data 121, the machine learning module 113 ceases analysis of the new form and awaits form data 119 related to another form to be learned.

In one embodiment, the financial preparation system 111 includes a user document preparation engine 117. The document preparation engine 117 is the engine that assists users of the electronic document preparation system 111 to prepare a financial document based on or including the newly learned form as well as other forms. The user document preparation engine 117 includes current document instructions data 131. The current document instructions data 131 includes software instructions, modules, engines, or other data or processes used to assist users of the electronic document preparation system 111 in electronically preparing a document.

In one embodiment, once the machine learning module 113 has fully learned the correct functions for the data fields of a new form, the machine learning module 113 incorporates the newly learned form into the electronic document preparation system 111 by updating the current document instructions data 131. When the current document instructions data 131 has been updated to include and recognize the new form, then users of the electronic document preparation system can electronically complete the new form using the electronic document preparation system 111. In this way, the electronic document preparation system 111 quickly provides functionality that electronically complete the data fields of the new form as part of preparing a financial document.

In one embodiment, the user computing environment 140 is a computing environment related to a user of the electronic document preparation system 111. The user computing environment 140 includes input devices 141 and output devices 142 for communicating with the user, according one embodiment. The input devices 141 include, but are not limited to, keyboards, mice, microphones, touchpads, touchscreens, digital pens, and the like. The output devices 142 include, but are not limited to, speakers, monitors, touchscreens, and the like. The output devices 142 can display data related to the preparation of the financial document.

In one embodiment, the machine learning module 113 can also generate interview content to assist in a financial document preparation interview. As a user utilizes the electronic document preparation system 111 to prepare a financial document, the user document preparation engine 117 may guide the user through a financial document preparation interview in order to assist the user in preparing the financial document. The interview content can include graphics, prompts, text, sound, or other electronic, visual, or audio content that assists the user to prepare the financial document. The interview content can prompt the user to provide data, to select relevant forms to be completed as part of the financial document preparation process, to explore financial topics, or otherwise assist the user in preparing the financial document. When the machine learning module 113 learns the correct function for each data field of a form, the machine learning module 113 can also generate text or other types of audio or video prompts that describe the function and that can prompt the user to provide information that the user document preparation engine 117 will use to complete the form. Thus, the machine learning module 113 can generate interview content to assist in a financial document preparation interview.

In one embodiment, the machine learning module 113 updates the current document instructions data 131 once a new form has been entirely learned without input or approval of an expert or other personnel. In one embodiment, the machine learning module 113 updates the current document instructions data 131 only after an expert has given approval that the new form has been properly learned.

In one embodiment, the machine learning module 113 only learns the candidate function for selected fields of a new form. For example, the machine learning module 113 may be configured to perform machine learning processes to learn the correct functions for certain types of data fields. Some types of data fields may not be as conducive to machine learning processes or for other reasons the machine learning module 113 may be configured to learn only particular data fields of a new form. In these cases, the machine learning module 113 will only learn certain selected data fields of the new form. In some cases, the machine learning module 113 may determine that it is unable to learn the correct function for one or more data fields after generating and testing many candidate functions for the one or more data fields. The results data 120 can therefore include data indicating that the correct function for a particular data field of the new form cannot be learned by the machine learning module 113.

In one embodiment, once the form data 119 has been provided to the electronic document preparation system 111, the expert or other personnel can input an indication of which data fields of the new form should be learned by the machine learning module 113. The machine learning module 113 will then only learn the correct functions for those fields of the new form that have been indicated by the expert or other personnel. In one embodiment, the form data 119 can indicate which data fields the machine learning module 113 should learn. In this way, the machine learning module 113 only attempt to learn selected data fields of a new form.

In one embodiment, the correct function for a data field may be simple or complex. A complex function may require that multiple data values be gathered from multiple places within other forms, the same form, from a user, or in other locations. A complex function may also include mathematical relationships that will be applied to the multiple data values in complex ways in order to generate the proper data value for the data field. A function may include finding the minimum data value among two or more data values, finding the maximum data value among two or more data values, addition, subtraction, multiplication, division, exponential functions, logic functions, existence conditions, string comparisons, etc. The machine learning module 113 can generate and test complex candidate functions until the correct function has been found for a particular data field.

In one embodiment, new forms may include data fields that expect data values that are alphabetical such as a first name, a last name, a middle name, a middle initial, a company name, a name of a spouse, a name of a child, a name of a dependent, a home address, a business address, a state of residence, the country of citizenship, or other types of data values that are generally alphabetic. In these cases, the correct function may include a person, a lasting, a middle name, a middle initial, a company name, a name of a spouse, a name of a child, a name of a defendant, a home address, a business address, a state residence, the country citizenship, or other types of alphabetic data values as the case may be. The correct function can also include a location from which these alphabetic data values may be retrieved in other forms, worksheets, or financial related data otherwise provided by users or gathered from various sources. The forms may also include data fields that expect data values that are numeric by nature. These a values may include incomes, tax withholdings, Social Security numbers, identification numbers, ages, loan payments, interest payments, charitable contributions, mortgage payments, dates, or other types of data values that are typically numeric in nature.

In one embodiment, the machine learning module 113 can generate candidate functions for a particular data field by referring to the dependency data that can provide an indication of the types of data that are likely to be included in the correct function and their likely location in other forms or data. For example, the machine learning module 113 can utilize historical document instructions data 130, natural language parsing data 132, current document instruction data 131, and other types of contextual clues or hints in order to find a likely starting place for generating candidate functions. For this reason, the electronic document preparation system 111 can include a natural language parsing module 115 and the historical form analysis module 116.

In one embodiment, the natural language parsing module 115 analyzes the form data 119 with a natural language parsing process. In particular, the natural language parsing module analyzes the text description associated with each data field of the new form on the analysis. For example, the form data 119 may include text descriptions for the various data fields of the new form. The natural language parsing module 115 analyzes these text descriptions and generates natural language parsing data 132 indicating the type of data value expected in each data field based on the text description. The natural language parsing module 115 provides the natural parsing data 132 to the machine learning module 113. The machine learning module 113 generates candidate functions for the various data fields based on the natural language parsing data 132. In this way, the machine learning module 113 utilizes the natural language parsing data 132 to assist in the machine learning process.

In one embodiment, the historical form analysis module 116 analyzes the form data 119 in order to determine if it is likely that previous versions of the electronic document preparation system 111 included software instructions that computed data values for data fields of historical forms that are similar to the new form. Accordingly, the historical form analysis module 116 analyzes the historical document instructions data 130 that includes software instructions from previous versions of the electronic document preparation system 111. Because it is possible that the previous versions of the electronic document preparation system utilized software languages or structures that are now obsolete, the historical document instructions data 130 cannot easily or simply be analyzed or imported into the current document instructions data 131. For this reason, the historical form analysis module 116 can analyze the historical document instructions data 130 related to historical forms that are similar to the new form. Such historical forms may include previous versions of the new form. The historical form analysis module 116 can identify from the outdated software language the correct functions related to data fields of the historical forms and can generate historical instruction analysis data that indicates the correct functions for the previous version of the form. The machine learning module 113 can utilize these instructions in order to find a starting point for generating the candidate functions in order to learn the data fields of the new form.

In some cases, a new form may be nearly identical to a previous known version of the form. In these cases, the training set data 122 can include historical data 123 that relates to previously prepared, filed, and/or approved financial documents that included or based on the previous known form. In these cases, the data acquisition module 114 will gather a training set data 122 that includes a large number of previously completed copies of the previous version of the form. The machine learning module 113 generates the candidate functions and applies them to the training set data as described previously.

In some cases, a new form may include data fields that are different enough that no analogous previously prepared financial documents are available to assist in the machine learning process. In one embodiment, the data acquisition module 114 gathers training set data 122 that includes fabricated financial data 124. The fabricated financial data 124 can include copies of the new form prepared with fabricated financial data by a third-party organization or a processor system associated with the service provider computing environment 110. The fabricated financial data 124 can be used by the machine learning module 113 in the machine learning process for learning the correct functions associated with the data fields of the new form. In such a case the machine learning module 113 generates candidate functions and applies them to the training set data 122 including the fabricated financial data 124 as described previously.

In one embodiment, the training set data 122 can include both historical data 123 and fabricated financial data 124. In some cases, the historical data 123 can include previously prepared documents as well as previously fabricated financial documents based on fictitious or real financial data.

In one embodiment, the data acquisition module 114 gathers new training set data 122 each time a new data field of the new form is to be analyzed by the machine learning module 113. The data acquisition module 114 can gather a large training set data 122 including many thousands or millions of previously prepared or previously fabricated financial documents. When a new data field of a new form is to be learned by the machine learning module 113, the data acquisition module 114 will gather training set data 122, or subset of the training set data 122, that includes a selected number of previously prepared financial documents that each have a data value in a data field of a form that corresponds to the data field of the new form that is currently being learned by the machine learning module 113. In some cases, the training set data 122 can include millions of previously prepared financial documents, not only a few hundred or thousands of the previously prepared documents are needed for analysis by the machine learning module 113. Thus, the data acquisition module 114 can gather training set data that is appropriate and efficient for the machine learning module 113 to use the learning the current data field of the new form.

In one embodiment, the electronic document preparation system 111 is a tax return preparation system. Preparing a single tax return can require many government tax forms, many internal worksheets use by the tax return preparation system in preparing a tax return, W-2 forms, and many other types of forms or financial data pertinent to the preparation of a tax return preparation system. For each tax return that is prepared for a user, the tax return preparation system maintains copies of all of the various tax forms, internal worksheets, data provided by the user and any other relevant financial data used to prepare the tax return. Thus, the tax return preparation system maintains historical tax return data related to millions of previously prepared tax returns. The tax return preparation system can utilize the historical tax return data to gather or generate relevant training set data 122 that can be used by the machine learning module 113.

In one embodiment, a state or federal agency releases a new tax form that is simply a new version of a previous tax form during tax return preparation season. an expert upload form data 119 to the interface module 112. The form data 119 corresponds to an electronic version of the new tax form. Many or all of the data fields of the new tax form may be similar to those of the previous tax form. The machine learning module 113 begins to learn the new tax form starting with a first selected data field of the new tax form. The first selected data field corresponds to a first selected line of the new tax form, not necessarily line 1 of the new tax form. The machine learning module 113 causes the data acquisition module 114 to gather training set data 122 that includes a large number of previously prepared tax returns and the tax related data associated with the previously prepared tax returns. In particular, the training set data 122 will include previously prepared tax returns that use the previous version of the new form. The machine learning module 113 generates a plurality of candidate functions for the first selected data field and applies them to the training set data 122. For each candidate function, the machine learning module 113 generates matching data 127 and/or confidence score data 128 indicating how well the test data 126 matches the training set data 122. The machine learning module 113 generates results data 120 indicating the matching data 127 and/or the confidence score data 128 of one or more of the candidate functions. The results data 120 can also indicate whether a candidate function is deemed to be the correct function for the first selected data field.

The machine learning module 113 moves onto a second selected data field after the correct function has been found for the first selected data field. The data fields correspond to selected lines of the new tax form. The machine learning module 113 continues in this manner until all selected data fields of the new tax form have been found. When all selected data fields of the new tax form have been learned, the machine learning module 113 generates learned form data 121 indicating that all selected fields of the new form have been learned. The interface module 112 can present results data 120 or learned form data 121 for review and/or approval by an expert or other personnel. Alternatively, the machine learning module 113 can move from one data field to the next data field without approval or review by an expert.

In one embodiment, the tax return preparation system receives form data 119 corresponding to a new form for which an adequate previously known form cannot be found. In this case, the data acquisition module 114 gathers training set data that can include fabricated financial data 124. The fabricated financial data 124 can include fictitious previously prepared tax returns and the fabricated financial data that was used to prepare them. The data acquisition module 114 can obtain the fabricated financial data 124 from one or more third parties, one or more associated tax return preparation systems, or in any other way. For example, the tax return preparation system can generate fabricated financial data and provided to one or more third parties to prepare a fabricated tax return using the new tax form. The fabricated financial data can include data related to real users of the tax return preparation system, a script of actual identifiers such as real names, real Social Security numbers, etc. The third parties can then prepare tax returns from the fabricated financial data using the new form. The third parties can then provide the fabricated tax returns to the tax return preparation system. The tax return preparation system can then utilize the fabricated financial data 124 in conjunction with the machine learning module 113 to learn the correct functions for the data fields of the new form.

In one example, the tax return preparation system receives form data 119 related to any tax form. The data acquisition module 114 gathers training set data 122 that includes historical tax return data related to previously prepared tax returns and or fabricated historical tax return data related to fabricated tax returns using the new form. The machine learning module 113 undertakes to learn the correct function for generating the data value to be entered into line 3 of the new tax form. The machine learning module 113 refers to the dependency data that indicates that the correct function for line 3 is possibly based on the values of line 31, line 2 c, and the constants 3000 and 6000. The training set data 122 includes numerous previously completed copies of the new form or a related form each having a data value in line 3. The training set data 122 also includes all the financial tax related data that were used to prepare the real or fabricated tax returns. The machine learning module 113 generates a candidate function for line 3 of the new form. The machine learning module 113 applies the candidate function to the training set data 122. In particular, the machine learning module 113 generates test data 126 by generating test values for line 3 of each of the previously completed copies of the new or related form. The machine learning module 113 generates matching data by comparing the test values to the actual completed data values from the training set data 122 for line 3. The matching data 127 indicates how well the test values match the actual values in line 3 of the previously completed forms. If the matching data 127 indicates that the test data 126 matches the training set data 122 within a selected error tolerance, then the machine learning module 113 determines that the candidate function is correct or may be correct. After many iterations of generating and testing candidate functions, the machine learning module 113 concludes that the correct function for line 3 is that if line 31 exists, then line 3 will be the same as line 31. If line 31 does not exist, then line 3 is the minimum of 6000 or 3000 multiplied by the value from line 2 c.

In one embodiment, the machine learning module 113 can also generate confidence score data 128 indicating a level of confidence that the candidate function is correct. The machine learning module 113 generates results data 120 that indicate that the candidate function is likely the correct function. The interface module 112 outputs the results data 120 for review and/or approval by expert or other personnel. The expert or other personnel can approve the candidate function, causing the machine learning module 113 to move to the next selective line of the new tax form. Alternatively, the machine learning module 113 can decide that the candidate function is correct without approval from an expert or other personnel and can move onto the next selective line of the new tax form. If the matching data 127 indicates that the candidate function does not match the training set data well, then the machine learning module 113 generates one or more other candidate functions and generates test data 126 by applying the one or more candidate functions to the training set data 122 in the same way. The machine learning module 113 can continue to generate candidate functions in successive iterations until the correct candidate function has been found. The machine learning module 113 can continue from one line of the new tax form to the next until all selected lines of the tax form have been correctly learned by the machine learning module 113.

In one embodiment, when all selected lines of the new tax form have been learned, the machine learning module 113 generates learned form data 121 that indicates that the new tax form has been learned. The learned form data 121 can also include the correct functions for each selected line of the new tax form. The interface module 112 can output the learned form data 121 for review by an expert or other personnel.

In one embodiment, when the tax form has been learned by the machine learning module 113, the machine learning module 113 updates the current document instructions data 131 to include software instructions for completing the new tax form as part of the tax return preparation process.

While the present disclosure describes a process for finding a correct candidate function, a correct candidate function can correspond to an acceptable candidate function that is not necessarily entirely correct. Therefore, it is possible that a correct candidate function may be an acceptable candidate function even if there is not complete surety that candidate function is entirely correct. A correct candidate function can be a candidate function that produces matching data that is accurate within an acceptable error threshold. Thus, embodiments of the present disclosure can identify acceptable candidate functions.

Embodiments of the present disclosure address some of the shortcomings associated with traditional electronic document preparation systems that do not adequately learn and incorporate new forms into the electronic document preparation system. An electronic document preparation system in accordance with one or more embodiments provides more reliable financial management services by utilizing machine learning and training set data to learn and incorporate new forms into the electronic document preparation system. The various embodiments of the disclosure can be implemented to improve the technical fields of data processing, data collection, resource management, and user experience. Therefore, the various described embodiments of the disclosure and their associated benefits amount to significantly more than an abstract idea. In particular, by utilizing machine learning to learn and incorporate new forms in the electronic document preparation system, electronic document preparation system can more efficiently learn and incorporate new forms into the electronic document preparation system.

PROCESS

FIG. 2 illustrates a functional flow diagram of a process 200 for learning and incorporating new forms in an electronic document preparation system, in accordance with one embodiment.

At block 202 the user interface module 112 receives form data related to a new form having a plurality of data fields that expect data values in accordance with specific functions, according to one embodiment. From block 202 the process proceeds to block 204.

At block 204 the data acquisition module 114 gathers training set data related to previously filled forms having completed data fields that each correspond to a respective data field of the new form, according to one embodiment. From block 204 the process proceeds to block 206.

At block 206 the machine learning module 113 generates candidate function data including, for each data field of the new form, a plurality of candidate functions for providing the expected data value for the data field, according to one embodiment. From block 206 the process proceeds to block 208.

At block 208 the machine learning module 113 generates test data by applying the candidate functions to the training set data, according to one embodiment. From block 208 the process proceeds to block 210.

At block 210 the machine learning module 113 generates matching data indicating how closely each candidate function matches the test data, according to one embodiment. From block 210 the process proceeds to block 212.

At block 212, the machine learning module 113 identifies a respective correct function for each data field of the new form based on the matching data. From block 212 the process proceeds to block 214.

At block 214 the machine learning module 113 generates results data indicating the correct function for each data field of the new form, according to one embodiment. From block 214 the process proceeds to block 216.

At block 216, the interface module 112 outputs the results data for review by an expert or other personnel, according to one embodiment.

Although a particular sequence is described herein for the execution of the process 200, other sequences can also be implemented. For example, the data acquisition module can gather training set data each time a new data field of the new form as to be learned. The machine learning module 113 can generate a single candidate function at a time and can generate test data and matching data for that candidate function and determine if the candidate function is correct based on the matching data. If the candidate function is not correct, the machine learning module 113 returns to step 206 and generates a new candidate function and repeats the process until the correct function has been found for the data field currently being learned. When the correct function is found for a particular data field, the data acquisition module can again gather training set data for the next data field and the machine learning module 113 can generate, test, and analyze candidate functions until the correct function has and found. The machine learning module 113 can generate candidate functions based on dependency data that indicates one or more possible dependencies for the correct function a given data field. The machine learning module 113 can generate candidate functions by selecting one or more operators from a set of operators. Other sequences can also be implemented.

FIG. 3 illustrates a flow diagram of a process 300 for learning and incorporating new forms in an electronic document preparation system, according to various embodiments.

In one embodiment, process 300 for learning and incorporating new forms in an electronic document preparation system begins at BEGIN 302 and process flow proceeds to RECEIVE FORM DATA RELATED TO A NEW FORM HAVING A PLURALITY OF DATA FIELDS 304.

In one embodiment, at RECEIVE FORM DATA RELATED TO A NEW FORM HAVING A PLURALITY OF DATA FIELDS 304 process 300 for learning and incorporating new forms in an electronic document preparation system receives form data related to a new form having a plurality of data fields.

In one embodiment, once process 300 for learning and incorporating new forms in an electronic document preparation system receives form data related to a new form having a plurality of data fields at RECEIVE FORM DATA RELATED TO A NEW FORM HAVING A PLURALITY OF DATA FIELDS 304 process flow proceeds to GATHER TRAINING SET DATA RELATED TO PREVIOUSLY FILLED FORMS, EACH PREVIOUSLY FILLED FORM HAVING COMPLETED DATA FIELDS THAT EACH CORRESPOND TO A RESPECTIVE DATA FIELD OF THE NEW FORM 306.

In one embodiment, at GATHER TRAINING SET DATA RELATED TO PREVIOUSLY FILLED FORMS, EACH PREVIOUSLY FILLED FORM HAVING COMPLETED DATA FIELDS THAT EACH CORRESPOND TO A RESPECTIVE DATA FIELD OF THE NEW FORM 306, process 300 for learning and incorporating new forms in an electronic document preparation system gathers training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form.

In one embodiment, once process 300 for learning and incorporating new forms in an electronic document preparation system gathers training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form at GATHER TRAINING SET DATA RELATED TO PREVIOUSLY FILLED FORMS, EACH PREVIOUSLY FILLED FORM HAVING COMPLETED DATA FIELDS THAT EACH CORRESPOND TO A RESPECTIVE DATA FIELD OF THE NEW FORM 306, process flow proceeds to GENERATE, FOR A FIRST SELECTED DATA FIELD OF THE PLURALITY OF DATA FIELDS OF THE NEW FORM, DEPENDENCY DATA INDICATING ONE OR MORE POSSIBLE DEPENDENCIES FOR A CORRECT FUNCTION THAT PROVIDES A PROPER DATA VALUE FOR THE FIRST SELECTED DATA FIELD 308.

In one embodiment, at GENERATE, FOR A FIRST SELECTED DATA FIELD OF THE PLURALITY OF DATA FIELDS OF THE NEW FORM, DEPENDENCY DATA INDICATING ONE OR MORE POSSIBLE DEPENDENCIES FOR A CORRECT FUNCTION THAT PROVIDES A PROPER DATA VALUE FOR THE FIRST SELECTED DATA FIELD 308, process 300 for learning and incorporating new forms in an electronic document preparation system generates, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field.

In one embodiment, once process 300 for learning and incorporating new forms in an electronic document preparation system generates, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field at GENERATE, FOR A FIRST SELECTED DATA FIELD OF THE PLURALITY OF DATA FIELDS OF THE NEW FORM, DEPENDENCY DATA INDICATING ONE OR MORE POSSIBLE DEPENDENCIES FOR A CORRECT FUNCTION THAT PROVIDES A PROPER DATA VALUE FOR THE FIRST SELECTED DATA FIELD 308, process flow proceeds to GENERATE, FOR THE FIRST SELECTED DATA FIELD, CANDIDATE FUNCTION DATA INCLUDING A PLURALITY OF CANDIDATE FUNCTIONS BASED ON THE DEPENDENCY DATA AND ONE OR MORE OPERATORS SELECTED FROM A SET OF OPERATORS 310.

In one embodiment, at GENERATE, FOR THE FIRST SELECTED DATA FIELD, CANDIDATE FUNCTION DATA INCLUDING A PLURALITY OF CANDIDATE FUNCTIONS BASED ON THE DEPENDENCY DATA AND ONE OR MORE OPERATORS SELECTED FROM A SET OF OPERATORS 310, process 300 for learning and incorporating new forms in an electronic document preparation system generates, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators, according to one embodiment.

In one embodiment, once process 300 for learning and incorporating new forms in an electronic document preparation system generates, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators at GENERATE, FOR THE FIRST SELECTED DATA FIELD, CANDIDATE FUNCTION DATA INCLUDING A PLURALITY OF CANDIDATE FUNCTIONS BASED ON THE DEPENDENCY DATA AND ONE OR MORE OPERATORS SELECTED FROM A SET OF OPERATORS 310, process flow proceeds to GENERATE, FOR EACH CANDIDATE FUNCTION, TEST DATA BY APPLYING THE CANDIDATE FUNCTION TO THE TRAINING SET DATA 312.

In one embodiment, at GENERATE, FOR EACH CANDIDATE FUNCTION, TEST DATA BY APPLYING THE CANDIDATE FUNCTION TO THE TRAINING SET DATA 312 the process 300 generates, for each candidate function, test data by applying the candidate function to the training set data.

In one embodiment, once process 300 generates, for each candidate function, test data by applying the candidate function to the training set data at GENERATE, FOR EACH CANDIDATE FUNCTION, TEST DATA BY APPLYING THE CANDIDATE FUNCTION TO THE TRAINING SET DATA 312, process flow proceeds to GENERATE, FOR EACH CANDIDATE FUNCTION, MATCHING DATA BY COMPARING THE TEST DATA TO THE COMPLETED DATA FIELDS CORRESPONDING TO THE FIRST SELECTED DATA FIELD, THE MATCHING DATA INDICATING HOW CLOSELY THE TEST DATA MATCHES THE CORRESPONDING COMPLETED DATA FIELDS OF THE PREVIOUSLY FILLED FORMS 314.

In one embodiment, at GENERATE, FOR EACH CANDIDATE FUNCTION, MATCHING DATA BY COMPARING THE TEST DATA TO THE COMPLETED DATA FIELDS CORRESPONDING TO THE FIRST SELECTED DATA FIELD, THE MATCHING DATA INDICATING HOW CLOSELY THE TEST DATA MATCHES THE CORRESPONDING COMPLETED DATA FIELDS OF THE PREVIOUSLY FILLED FORMS 314 the process 300 for learning and incorporating new forms in an electronic document preparation system generates, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms.

In one embodiment, once the process 300 for learning and incorporating new forms in an electronic document preparation system generates, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms at GENERATE, FOR EACH CANDIDATE FUNCTION, MATCHING DATA BY COMPARING THE TEST DATA TO THE COMPLETED DATA FIELDS CORRESPONDING TO THE FIRST SELECTED DATA FIELD, THE MATCHING DATA INDICATING HOW CLOSELY THE TEST DATA MATCHES THE CORRESPONDING COMPLETED DATA FIELDS OF THE PREVIOUSLY FILLED FORMS 314, process flow proceeds to IDENTIFY, FROM THE PLURALITY OF CANDIDATE FUNCTIONS, A CORRECT CANDIDATE FUNCTION FOR THE FIRST DATA FIELD OF THE NEW FORM BY DETERMINING, FOR EACH CANDIDATE FUNCTION, WHETHER OR NOT THE CANDIDATE FUNCTION IS THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD OF THE NEW FORM BASED ON THE MATCHING DATA 316.

In one embodiment, at IDENTIFY, FROM THE PLURALITY OF CANDIDATE FUNCTIONS, A CORRECT CANDIDATE FUNCTION FOR THE FIRST DATA FIELD OF THE NEW FORM BY DETERMINING, FOR EACH CANDIDATE FUNCTION, WHETHER OR NOT THE CANDIDATE FUNCTION IS THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD OF THE NEW FORM BASED ON THE MATCHING DATA 316 the process 300 for learning and incorporating new forms in an electronic document preparation system identifies, from the plurality of candidate functions, a correct candidate function for the first data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data.

In one embodiment, once the process 300 for learning and incorporating new forms in an electronic document preparation system identifies, from the plurality of candidate functions, a correct candidate function for the first data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data at IDENTIFY, FROM THE PLURALITY OF CANDIDATE FUNCTIONS, A CORRECT CANDIDATE FUNCTION FOR THE FIRST DATA FIELD OF THE NEW FORM BY DETERMINING, FOR EACH CANDIDATE FUNCTION, WHETHER OR NOT THE CANDIDATE FUNCTION IS THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD OF THE NEW FORM BASED ON THE MATCHING DATA 316, process flow proceeds to GENERATE, AFTER IDENTIFYING THE CORRECT FUNCTION FOR THE FIRST DATA FIELD, RESULTS DATA INDICATING THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD OF THE NEW FORM 318.

In one embodiment, at GENERATE, AFTER IDENTIFYING THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD, RESULTS DATA INDICATING THE CORRECT FUNCTION FOR THE FIRST DATA FIELD OF THE NEW FORM 318, the process 300 for learning and incorporating new forms in an electronic document preparation system generates, after identifying the correct function for the first data field, results data indicating the correct function for the first selected data field of the new form.

In one embodiment, once the process 300 for learning and incorporating new forms in an electronic document preparation system generates, after identifying the correct function for the first selected data field, results data indicating the correct function for the first data field of the new form at GENERATE, AFTER IDENTIFYING THE CORRECT FUNCTION FOR THE FIRST SELECTED DATA FIELD, RESULTS DATA INDICATING THE CORRECT FUNCTION FOR THE FIRST DATA FIELD OF THE NEW FORM 318 proceeds to OUTPUT THE RESULTS DATA 320.

In one embodiment, at OUTPUT THE RESULTS DATA 320 the process 300 for learning and incorporating new forms in an electronic document preparation system outputs the results data.

In one embodiment, once the process 300 for learning and incorporating new forms in an electronic document preparation system outputs the results data at OUTPUT THE RESULTS DATA 320, process flow proceeds to END 322.

In one embodiment, at END 322 the process for learning and incorporating new forms in an electronic document preparation system is exited to await new data and/or instructions.

As noted above, the specific illustrative examples discussed above are but illustrative examples of implementations of embodiments of the method or process for learning and incorporating new forms in an electronic document preparation system. Those of skill in the art will readily recognize that other implementations and embodiments are possible. Therefore, the discussion above should not be construed as a limitation on the claims provided below.

In one embodiment, a computing system implements a method for learning and incorporating new forms in an electronic document preparation system. The method includes receiving form data related to a new form having a plurality of data fields and gathering training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The method also includes generating, for a first selected data field from the plurality of data fields of the new form, candidate function data including a plurality of candidate input functions for providing a proper data value for the first selected data field, generating, for each candidate function, test data by applying the candidate function to the training set data, and generating, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field. The matching data indicates how closely the test data matches the corresponding completed data fields of the previously filled forms. The method also includes identifying, from the plurality of functions, a correct candidate function for the first data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data. The method also includes generating, after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form and outputting the results data.

In one embodiment, a non-transitory computer-readable medium has a plurality of computer-executable instructions which, when executed by a processor, perform a method for learning and incorporating new forms in an electronic document preparation system. The instructions include an interface module configured to receive form data representing to a new form having a plurality of data fields and a data acquisition module configured to gather training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The instructions also include a machine learning module configured to identify a respective correct function for each of the data fields of the new form by generating candidate function data relating to a plurality of candidate functions, generating test data by applying the candidate functions to the training set data, and finding, for each of the data fields a respective correct function from the plurality of candidate functions based on a how closely the test data matches the candidate function data.

One embodiment is a system for learning and incorporating new forms in an electronic document preparation system. The system includes at least one processor and at least one memory coupled to the at least one processor, the at least one memory having stored therein instructions which, when executed by any set of the one or more processors, perform a process. The process includes receiving, with an interface module of a computing system, form data related to a new form having a plurality of data fields and gathering training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The process also includes generating, with a data acquisition module of a computing system, for a first selected data field from the plurality of data fields of the new form, candidate function data including a plurality of candidate input functions for providing a proper data value for the first selected data field. The process also includes generating, with a machine learning module of a computing system, for each candidate function, test data by applying the candidate function to the training set data and generating, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field. The matching data indicates how closely the test data matches the corresponding completed data fields of the previously filled forms. The process also includes identifying, with the machine learning module, from the plurality of functions, a correct candidate function for the first data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data. The process also includes generating, with the machine learning module, after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form and outputting, with the interface module, the results data.

One embodiment is a computing system implemented method for learning and incorporating new forms in an electronic document preparation system. The method includes receiving form data related to a new form having a plurality of data fields, gathering training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The method also includes generating, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field. The method further includes generating, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators, generating, for each candidate function, test data by applying the candidate function to the training set data, and generating, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms. The method also includes identifying, from the plurality of functions, the correct candidate function for the first selected data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data, generating, after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form, and outputting the results data.

One embodiment is a non-transitory computer-readable medium having a plurality of computer-executable instructions which, when executed by a processor, perform a method for learning and incorporating new forms in an electronic document preparation system. The instructions include an interface module configured to receive form data representing to a new form having a plurality of data fields. The instructions include a data acquisition module configured to gather training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The instructions also include a machine learning module configured to identify a respective correct function for each of the data fields of the new form by generating candidate function data relating to a plurality of candidate functions based on dependency data indicating possible dependencies for each data field of the new form and including one or more operators from a set of operators, generating test data by applying the candidate functions to the training set data, and finding, for each of the data fields a respective correct function from the plurality of candidate functions based on a how closely the test data matches the candidate function data.

One embodiment is a system for learning and incorporating new forms in an electronic document preparation system. The system includes at least one processor at least one memory coupled to the at least one processor. The at least one memory has stored therein instructions which, when executed by any set of the one or more processors, perform a process. The process includes receiving, with an interface module of a computing system, form data related to a new form having a plurality of data fields, gathering, with a data acquisition module of a computing system, training set data related to previously filled forms. Each previously filled form has completed data fields that each correspond to a respective data field of the new form. The process also includes generating, with a machine learning module of a computing system, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field. The process also includes generating, with the machine learning module, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators, generating, with the machine learning module, for each candidate function, test data by applying the candidate function to the training set data, and generating, with the machine learning module, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms. The process also includes identifying, with the machine learning module, from the plurality of functions, the correct candidate function for the first selected data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data, generating, with the machine learning module and after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form, and outputting, with the interface module, the results data.

In the discussion above, certain aspects of one embodiment include process steps and/or operations and/or instructions described herein for illustrative purposes in a particular order and/or grouping. However, the particular order and/or grouping shown and discussed herein are illustrative only and not limiting. Those of skill in the art will recognize that other orders and/or grouping of the process steps and/or operations and/or instructions are possible and, in some embodiments, one or more of the process steps and/or operations and/or instructions discussed above can be combined and/or deleted. In addition, portions of one or more of the process steps and/or operations and/or instructions can be re-grouped as portions of one or more other of the process steps and/or operations and/or instructions discussed herein. Consequently, the particular order and/or grouping of the process steps and/or operations and/or instructions discussed herein do not limit the scope of the invention as claimed below.

As discussed in more detail above, using the above embodiments, with little or no modification and/or input, there is considerable flexibility, adaptability, and opportunity for customization to meet the specific needs of various parties under numerous circumstances.

In the discussion above, certain aspects of one embodiment include process steps and/or operations and/or instructions described herein for illustrative purposes in a particular order and/or grouping. However, the particular order and/or grouping shown and discussed herein are illustrative only and not limiting. Those of skill in the art will recognize that other orders and/or grouping of the process steps and/or operations and/or instructions are possible and, in some embodiments, one or more of the process steps and/or operations and/or instructions discussed above can be combined and/or deleted. In addition, portions of one or more of the process steps and/or operations and/or instructions can be re-grouped as portions of one or more other of the process steps and/or operations and/or instructions discussed herein. Consequently, the particular order and/or grouping of the process steps and/or operations and/or instructions discussed herein do not limit the scope of the invention as claimed below.

The present invention has been described in particular detail with respect to specific possible embodiments. Those of skill in the art will appreciate that the invention may be practiced in other embodiments. For example, the nomenclature used for components, capitalization of component designations and terms, the attributes, data structures, or any other programming or structural aspect is not significant, mandatory, or limiting, and the mechanisms that implement the invention or its features can have various different names, formats, or protocols. Further, the system or functionality of the invention may be implemented via various combinations of software and hardware, as described, or entirely in hardware elements. Also, particular divisions of functionality between the various components described herein are merely exemplary, and not mandatory or significant. Consequently, functions performed by a single component may, in other embodiments, be performed by multiple components, and functions performed by multiple components may, in other embodiments, be performed by a single component.

Some portions of the above description present the features of the present invention in terms of algorithms and symbolic representations of operations, or algorithm-like representations, of operations on information/data. These algorithmic or algorithm-like descriptions and representations are the means used by those of skill in the art to most effectively and efficiently convey the substance of their work to others of skill in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs or computing systems. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as steps or modules or by functional names, without loss of generality.

Unless specifically stated otherwise, as would be apparent from the above discussion, it is appreciated that throughout the above description, discussions utilizing terms such as, but not limited to, “activating”, “accessing”, “adding”, “aggregating”, “alerting”, “applying”, “analyzing”, “associating”, “calculating”, “capturing”, “categorizing”, “classifying”, “comparing”, “creating”, “defining”, “detecting”, “determining”, “distributing”, “eliminating”, “encrypting”, “extracting”, “filtering”, “forwarding”, “generating”, “identifying”, “implementing”, “informing”, “monitoring”, “obtaining”, “posting”, “processing”, “providing”, “receiving”, “requesting”, “saving”, “sending”, “storing”, “substituting”, “transferring”, “transforming”, “transmitting”, “using”, etc., refer to the action and process of a computing system or similar electronic device that manipulates and operates on data represented as physical (electronic) quantities within the computing system memories, resisters, caches or other information storage, transmission or display devices.

The present invention also relates to an apparatus or system for performing the operations described herein. This apparatus or system may be specifically constructed for the required purposes, or the apparatus or system can comprise a general purpose system selectively activated or configured/reconfigured by a computer program stored on a computer program product as discussed herein that can be accessed by a computing system or other device.

Those of skill in the art will readily recognize that the algorithms and operations presented herein are not inherently related to any particular computing system, computer architecture, computer or industry standard, or any other specific apparatus. Various general purpose systems may also be used with programs in accordance with the teaching herein, or it may prove more convenient/efficient to construct more specialized apparatuses to perform the required operations described herein. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present invention is not described with reference to any particular programming language and it is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any references to a specific language or languages are provided for illustrative purposes only and for enablement of the contemplated best mode of the invention at the time of filing.

The present invention is well suited to a wide variety of computer network systems operating over numerous topologies. Within this field, the configuration and management of large networks comprise storage devices and computers that are communicatively coupled to similar or dissimilar computers and storage devices over a private network, a LAN, a WAN, a private network, or a public network, such as the Internet.

It should also be noted that the language used in the specification has been principally selected for readability, clarity and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the claims below.

In addition, the operations shown in the FIG.s, or as discussed herein, are identified using a particular nomenclature for ease of description and understanding, but other nomenclature is often used in the art to identify equivalent operations.

Therefore, numerous variations, whether explicitly provided for by the specification or implied by the specification or not, may be implemented by one of skill in the art in view of this disclosure. 

What is claimed is:
 1. A computing system implemented method for learning and incorporating new forms in an electronic document preparation system, the method comprising: receiving form data related to a new form having a plurality of data fields; gathering training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form; generating, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field; generating, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators; generating, for each candidate function, test data by applying the candidate function to the training set data; generating, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms; identifying, from the plurality of functions, the correct candidate function for the first selected data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data; generating, after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form; and outputting the results data.
 2. The method of claim 1, wherein outputting the results data includes outputting the results data for review by an expert.
 3. The method of claim 2, further comprising prompting an expert to approve the correct function for the first data field of the new form.
 4. The method of claim 1, further comprising, after identifying the correct function for the first selected data field of the new form, identifying a second correct function for a second selected data field from the plurality of data fields of the new form.
 5. The method of claim 4, wherein identifying the second correct function for the second selected data field of the new form includes: generating, for the second selected data field, second dependency data indicating one or more possible second dependencies for the second correct function; generating, for the second selected data field, second candidate function data including a plurality of second candidate functions based on the second dependency data and one or more operators selected from a set of operators; generating, for each second candidate function, second test data by applying the second candidate function to the second training set data; generating, for each candidate function, second matching data by comparing the second test data to the completed data fields corresponding to the second selected data field, the second matching data indicating how closely the second test data matches the corresponding completed data fields of the previously filled forms; identifying, from the plurality of second candidate functions, the second correct candidate function for the first selected data field of the new form by determining, for each second candidate function, whether or not the second candidate function is the second correct function for the second selected data field of the new form based on the second matching data; generating, after identifying the second correct function for the second selected data field, second results data indicating the second correct function for the second data field of the new form; and outputting the second results data.
 6. The method of claim 1, further comprising generating, for each candidate function, confidence score data based on the matching data and indicating a level of confidence that the candidate function is the correct function for the first data field.
 7. The method of claim 1, wherein the training set data includes historical financial data related to previously prepared financial documents, the historical financial data including the previously filled forms.
 8. The method of claim 7, wherein the historical financial data includes previously prepared financial documents that were previously filed with a government or financial institution.
 9. The method of claim 1, wherein the training set data includes fabricated financial data related to fabricated financial documents, the fabricated financial data including the previously filled forms.
 10. The method of claim 9, further comprising receiving the fabricated financial data from one or more third parties.
 11. The method of claim 1, further comprising generating historical document instructions data related to software instructions for completing historical forms.
 12. The method of claim 11, further comprising analyzing the historical document instructions data.
 13. The method of claim 12, wherein generating the candidate function data includes generating the dependency data based on the historical document instructions data.
 14. The method of claim 1, further comprising generating natural language parsing data by performing natural language parsing analysis on the form data.
 15. The method of claim 11, wherein generating the dependency data includes generating the dependency data based on the natural language parsing data.
 16. The method of claim 1, wherein the set of operators includes one or more of: an addition operator; a subtraction operator; a division operator; a multiplication operator; an exponential operator; logical operators; a string comparison operator; and existence condition operators.
 17. The method of claim 1, wherein the candidate functions generate numerical data values.
 18. The method of claim 1, wherein the possible dependencies include one or more of: a data field from the new form; multiple data fields from the new form; a data field from a form other than a new form; multiple data fields from multiple forms other than the new form; and a constant.
 19. The method of claim 1, wherein the new form is a new tax form.
 20. The method of claim 19, wherein the training set data includes previously prepared tax returns.
 21. The method of claim 20, wherein the training set data includes fabricated tax returns.
 22. A non-transitory computer-readable medium having a plurality of computer-executable instructions which, when executed by a processor, perform a method for learning and incorporating new forms in an electronic document preparation system, the instructions comprising: an interface module configured to receive form data representing to a new form having a plurality of data fields; a data acquisition module configured to gather training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form; and a machine learning module configured to identify a respective correct function for each of the data fields of the new form by generating candidate function data relating to a plurality of candidate functions based on dependency data indicating possible dependencies for each data field of the new form and including one or more operators from a set of operators, generating test data by applying the candidate functions to the training set data, and finding, for each of the data fields a respective correct function from the plurality of candidate functions based on a how closely the test data matches the candidate function data.
 23. The non-transitory computer-readable medium of claim 22 wherein the machine learning module is configured to continue generating candidate functions for each data field of the new form until the corresponding correct function is found.
 24. The non-transitory computer-readable medium of claim 22 wherein the machine learning module is configured to generate results data indicating the respective correct function for each data field of the new form.
 25. The non-transitory computer-readable medium of claim 22 wherein the electronic document preparation system includes a financial document preparation system.
 26. The non-transitory computer-readable medium of claim 22 wherein the financial document preparation system includes a tax return preparation system.
 27. A system for learning and incorporating new forms in an electronic document preparation system, the system comprising: at least one processor; and at least one memory coupled to the at least one processor, the at least one memory having stored therein instructions which, when executed by any set of the one or more processors, perform a process including: receiving, with an interface module of a computing system, form data related to a new form having a plurality of data fields; gathering, with a data acquisition module of a computing system, training set data related to previously filled forms, each previously filled form having completed data fields that each correspond to a respective data field of the new form; generating, with a machine learning module of a computing system, for a first selected data field of the plurality of data fields of the new form, dependency data indicating one or more possible dependencies for a correct function that provides a proper data value for the first selected data field; generating, with the machine learning module, for the first selected data field, candidate function data including a plurality of candidate functions based on the dependency data and one or more operators selected from a set of operators; generating, with the machine learning module, for each candidate function, test data by applying the candidate function to the training set data; generating, with the machine learning module, for each candidate function, matching data by comparing the test data to the completed data fields corresponding to the first selected data field, the matching data indicating how closely the test data matches the corresponding completed data fields of the previously filled forms; identifying, with the machine learning module, from the plurality of functions, the correct candidate function for the first selected data field of the new form by determining, for each candidate function, whether or not the candidate function is the correct function for the first selected data field of the new form based on the matching data; generating, with the machine learning module and after identifying the correct function for the first data field, results data indicating the correct function for the first data field of the new form; and outputting, with the interface module, the results data.
 28. The system of claim 27, wherein outputting the results data includes outputting the results data for review by an expert.
 29. The system of claim 28, further comprising prompting an expert to approve the correct function for the first data field of the new form.
 30. The system of claim 27, wherein the process further includes, after identifying the correct function for the first selected data field of the new form, identifying a second correct function for a second selected data field from the plurality of data fields of the new form.
 31. The system of claim 30, wherein identifying the second correct function for the second selected data field of the new form includes: generating, for the second selected data field, second dependency data indicating one or more possible second dependencies for the second correct function; generating, for the second selected data field, second candidate function data including a plurality of second candidate functions based on the second dependency data and one or more operators selected from a set of operators; generating, for each second candidate function, second test data by applying the second candidate function to the second training set data; generating, for each candidate function, second matching data by comparing the second test data to the completed data fields corresponding to the second selected data field, the second matching data indicating how closely the second test data matches the corresponding completed data fields of the previously filled forms; identifying, from the plurality of second candidate functions, the second correct candidate function for the first selected data field of the new form by determining, for each second candidate function, whether or not the second candidate function is the second correct function for the second selected data field of the new form based on the second matching data; generating, after identifying the second correct function for the second selected data field, second results data indicating the second correct function for the second data field of the new form; and outputting the second results data.
 32. The system of claim 27, wherein the process includes generating, for each candidate function, confidence score data based on the matching data and indicating a level of confidence that the candidate function is the correct function for the first data field.
 33. The system of claim 27, wherein the training set data includes historical financial data related to previously prepared financial documents, the historical financial data including the previously filled forms.
 34. The system of claim 33, wherein receiving the form data includes receiving the form data automatically from a third party.
 35. The system of claim 27, wherein the training set data includes fabricated financial data related to fabricated financial documents, the fabricated financial data including the previously filled forms.
 36. The system of claim 35, wherein the process includes receiving the fabricated financial data from one or more third parties.
 37. The system of claim 27, wherein the process includes generating historical document instructions data related to software instructions for completing historical forms.
 38. The system of claim 37, wherein the process includes generating the dependency data by analyzing the historical document instructions data.
 39. The system of claim 38, wherein the process includes generating the candidate function data based on the historical document instructions data.
 40. The system of claim 27, wherein the process includes generating natural language parsing data by performing natural language parsing analysis on the form data.
 41. The system of claim 37, wherein the process includes generating the dependency data based on the natural language parsing data.
 42. The system of claim 27, wherein the set of operators includes one or more of: an addition operator; a subtraction operator; a division operator; a multiplication operator; an exponential operator; logical operators; a string comparison operator; and existence condition operators.
 43. The system of claim 27, wherein the candidate functions generate numerical data values.
 44. The system of claim 27, wherein the possible dependencies include one or more of: a data field from the new form; multiple data fields from the new form; a data field from a form other than a new form; multiple data fields from multiple forms other than the new form; and a constant.
 45. The system of claim 27, wherein the new form is a new tax form.
 46. The system of claim 45, wherein the training set data includes previously prepared tax return.
 47. The system of claim 45, wherein the process includes updating a tax return preparation system based on the correct function.
 48. The system of claim 47, wherein the process further includes updating a tax return preparation interview script to include one or more prompts, notifications, or explanations to a user based on the correct function. 